Genome Biology
○ Springer Science and Business Media LLC
Preprints posted in the last 90 days, ranked by how well they match Genome Biology's content profile, based on 555 papers previously published here. The average preprint has a 0.30% match score for this journal, so anything above that is already an above-average fit.
Ioannou, A.; Friman, E. T.; Daub, C. O.; Bickmore, W. A.; Biddie, S. C.
Show abstract
Plasma cell-free RNA (cfRNA) reflects tissue- and cell-type-specific activity across pathological states and is a promising biomarker for organ injury and disease. Computational deconvolution methods are widely used to infer organ and cell-type contributions to cfRNA profiles. However, most were originally developed for single-tissue bulk transcriptomes and their performance in body-wide cfRNA settings, where any tissue or cell type can contribute, remains poorly characterised. Here, we present a systematic benchmarking of tissue- and cell type-of-origin deconvolution for plasma cfRNA that considers both methodological and reference-related sources of variability under realistic cfRNA simulation settings. We evaluated seven commonly used deconvolution methods across distinct algorithmic classes and multi-organ reference configurations derived from bulk and single-cell atlases. We assessed performance using simulation frameworks that model multi-organ mixtures, technical noise, and transcript degradation. We further examined deconvolution methods across multiple previously published clinical cfRNA cohorts spanning diverse disease contexts. Across both tissue- and cell-type-level analyses, deconvolution performance was strongly influenced by both method choice and reference parameters. Tissue-of-origin inference was comparatively robust across simulated and clinical datasets, recovering disease-associated organ signals and concordance with biochemical markers. In contrast, cell type-of-origin inference showed greater variability and reduced consistency across analytical settings, leading to divergent interpretations in both simulations and published clinical cfRNA cohorts. Together, these findings demonstrate that methodological and reference-related variability are major sources of uncertainty in cfRNA deconvolution, with tissue-level inference being more robust than cell-type-level inference. Our benchmarking framework provides guidance for reference selection and comparative interpretation in cfRNA deconvolution.
Setter, D.; Lohse, K.; Baird, S. J. E.
Show abstract
Most ancestry-assignment methods rely on putatively pure reference panels, which are often unrealistic and bias inference. The genome polarisation algorithm diem, introduced previously, avoids reference panels by jointly inferring the polarity of common allelic states and quantifying variant diagnosticity via an expectation-maximisation procedure. Here we present diempy, an efficient python implementation of diem coupled with tools that turn polarised calls into analysis-ready outputs. diempy offers lossless VCF-to-diem BED conversion; ploidy-aware handling of individuals and chromosomes; flexible masking of sites, regions and individuals; and interactive visualisation of polarised genomes, hybrid indices, clines and ternary plots. Post-processing functions include DI thresholding, kernel smoothing, and automatic detection and run-length encoding of contiguous ancestry tracts. BED-based I/O facilitates integration with population-genomic workflows (e.g. filtering by annotation or ploidy). These features make reference-free genome polarisation with diempy practical and reproducible for studies of population structure, admixture and species barriers.
Wells, S. B.; Shahnawaz, H.; Jones, J. L.
Show abstract
dreampy is a Python implementation of the R dreamlet framework for pseudobulk differential expression analysis of single-cell RNA-seq data. dreamlet combines voom precision-weighted linear mixed models with empirical Bayes moderation to handle batch effects, repeated measures, and other hierarchical structure in multi-donor studies, but exists entirely within the R/Bioconductor ecosystem. dreampy reproduces this pipeline natively in Python, integrating with AnnData and the scverse ecosystem.
Zhang, S.; Lu, Y.; Luo, Q.; An, L.
Show abstract
Identifying cell type-specific expressed genes (marker genes) is essential for understanding the roles and interactions of cell populations within tissues. To achieve this, the traditional differential analysis approaches are often applied to individual cell-type bulk RNA-seq and single-cell RNA-seq data. However, real-world datasets often pose challenges, such as heterogeneous bulk RNA-seq and incomplete scRNA-seq. Heterogeneous bulk RNA-seq amalgamates gene expression profiles from multiple cell types and results in low resolution, while incomplete scRNA-seq does not capture some cell types from the tissue, leading to unknown cell types. Traditional methods fail to identify marker genes for such unknown cell types. MiCBuS addresses this limitation by generating Dirichlet-pseudo-bulk RNA-seq based on bulk and incomplete single-cell RNA-seq data. By performing differential analysis of gene expressions on bulk and Dirichlet-pseudo-bulk RNA-seq samples, MiCBuS can identify the marker genes of unknown cell types, enabling the identification and characterization of these elusive cellular components. Simulation studies and real data analyses demonstrate that MiCBuS reliably and robustly identifies marker genes specific to unknown cell types, a capability that traditional differential analysis methods cannot achieve. Availability and implementationMiCBuS is implemented in the R language and freely available at https://github.com/Shanshan-Zhang/MiCBuS.
Palamin, M.; Krebs, A.
Show abstract
Chromatin accessibility and histone post-translational modifications are widely used to identify and characterize cis-regulatory elements. Yet, these are typically measured separately, precluding direct linkage between them. Here, we present Chromatin-informed Single Molecule Footprinting (ChromSMF), a method that simultaneously quantifies histone modifications, transcription factor binding, and nucleosome occupancy, while measuring DNA methylation and sequence variations on the same multi-kilobase DNA molecules. ChromSMF combines antibody-guided tethering of the adenine methyltransferase Hia5 to modified histones, protein-DNA footprinting using the cytosine methyltransferase M.CviPI, and direct methylation detection by Nanopore sequencing. We benchmark ChromSMF across five histone modifications, uncovering relationships between epigenetic states and chromatin opening across entire cis-regulatory landscapes. We further present a computational framework for integrated analysis of multiple layers of epigenetic regulation on individual haplotypes. Together, ChromSMF provides an integrated genome-wide genomic method to investigate the combinatorial function of multiple genetic and epigenetic factors on gene regulation across diverse cellular contexts.
Pachter, L.
Show abstract
The edgeR Bioconductor package is one of the most widely used tools for differential expression analysis of count-based genomics data. Despite its popularity, the R-only implementation limits its integration with the Python-centric ecosystem that has become dominant in single-cell genomics. We present edgePython, a Python port of edgeR 4.8.2 that extends the framework with a negative binomial-gamma mixed model for multi-subject single-cell analysis and empirical Bayes shrinkage of cell-level dispersion.
Vahedi Torghabeh, B.; Moslemi, C.; Dybdal Jensen, J.; Hentrup, S.; Li, T.; Yu, X.; Wang, H.; Asp, T.; Ramstein, G. P.
Show abstract
Predicting gene expression from cis-regulatory DNA sequences at the promoter and terminator regions is a central challenge in plant genomics. This capability is also a prerequisite for assessing the effects of regulatory mutations on gene expression. Here, we developed deep learning sequence-to-expression (S2E) models that leverage context-aware sequence embeddings from the PlantCaduceus genomic language model instead of one-hot encoding of sequences, to predict gene expression across 17 plant species. To further improve predictions, we integrated chromatin accessibility data as auxiliary regulatory features. First, we evaluated our models to predict gene expression on unseen gene families via cross-validation, demonstrating our models prediction accuracy across all species outperforms PhytoExpr, the current state-of-the-art (SOTA) S2E model in plants (Pearson R=0.82 vs. R=0.74). We then validated variant effect predictions using an experimental dataset across 796 Brachypodium mutant lines, specifically designed to test predictions at single-base resolution. Our models outperformed SOTA S2E models in predicting between-gene expression differences (regression coefficient {beta}=0.78 vs. {beta}=0.57). Remarkably, they also accurately predicted the effects of single-nucleotide mutations on within-gene expression, while SOTA S2E models showed only weak associations (regression coefficient {beta}=0.38 vs. {beta}=0.08). Our results demonstrated the value of context-aware DNA sequence embeddings for predicting regulatory variant effects in plants. They also reveal a persistent accuracy gap in S2E models when moving from between-gene to allelic variation, a challenge that needs to be addressed in future S2E studies.
Thapa, S.; Samderiya, K.; Menon, R.; Oluwadare, O.
Show abstract
Accurate splice site prediction is fundamental to understanding gene expression and its associated disorders. However, most existing models are biased toward frequent canonical sites, limiting their ability to detect rare but biologically important non-canonical variants. These models often rely heavily on large, imbalanced datasets that fail to capture the sequence diversity of non-canonical sites, leading to high false-negative rates. Here, we present SpliceRead, a novel deep learning model designed to improve the classification of both canonical and non-canonical splice sites using a combination of residual convolutional blocks and synthetic data augmentation. SpliceRead employs a data augmentation method to generate diverse non-canonical sequences and uses residual connections to enhance gradient flow and capture subtle genomic features. Trained and tested on a multi-species dataset of 400- and 600-nucleotide sequences, SpliceRead consistently outperforms state-of-the-art models across all key metrics, including F1-score, accuracy, precision, and recall. Notably, it achieves a substantially lower non-canonical misclassification rate than baseline methods. Extensive evaluations, including cross-validation, cross-species testing, and input-length generalization, confirm its robustness and adaptability. SpliceRead offers a powerful, generalizable framework for splice site prediction, particularly in challenging, low-frequency sequence scenarios, and paves the way for more accurate gene annotation in both model and non-model organisms.The open sourced code of SpliceRead and a detailed documentation is available at The open-sourced code of SpliceRead and detailed documentation are available at https://github.com/OluwadareLab/SpliceRead.
Krishnan, N. M.; Rahman, S. I.; Olsen, L. R.; Panda, B.
Show abstract
Many biological studies could benefit from combining data from legacy microarray and high throughput sequencing platforms, especially in clinical domains where collecting additional samples is not possible. However, incompatibility between platforms makes legacy data difficult to integrate, owing to differences in platform design, target preparation, and dependence on prior annotations. Here, we describe X Plat, a cross platform data transformation tool for both expression and methylation assays that inter converts data between microarray and sequencing platforms using per gene second degree polynomial regression. X Plat learns cross platform conversion rules from paired microarray sequencing datasets spanning multiple conditions, sample sources, organisms, and platforms, and evaluates performance using cross validated root mean square error (RMSE) per gene. In rat, Arabidopsis, and human datasets, X Plat achieved lower cross validated RMSE than TDM, HARMONY, and HARMONY2 for the vast majority of genes (equal to or greater than 95% in all sequencing to array transformations and most array to sequencing transformations, with nearly 82% in the Arabidopsis array to sequencing setting), and these findings were confirmed using RMSE on held out test samples from the first cross validation fold. X Plat also achieved low RMSE (less than or equal to 0.2) for the majority of CpG regions in paired human array and sequencing methylation datasets. Using X Plat, users can transform data between microarray and high throughput sequencing platforms, enabling cross platform comparison and reuse of legacy cohorts.
Caskey, M.; Rich, J.; Weber, R.; Mortazavi, A.; Pachter, L.; Hallgrimsdottir, I. B.
Show abstract
Single-cell genomics technologies enable high-throughput cell profiling, but technical contamination remains an obstacle to accurate downstream analysis. Free-floating ambient molecules released from lysed cells and global bulk contamination introduced during library preparation can distort molecular profiles. These artifacts can obscure cellular identities and reduce the reliability of differential analysis or clustering results. We present an efficient and effective approach to removing ambient and bulk contamination that can be applied to data generated from a wide variety of technologies. We show that our tool, CellSweep, outperforms other methods to remove artifacts using numerous benchmarks.
Tawfik, Y.; Diekmann, Y.; Orlando, L.; Burger, J.; Bloecher, J.
Show abstract
Age-at-death estimation of archaeological human remains is central to palaeodemographic research yet remains particularly challenging for adults where osteological methods often produce imprecise age ranges. Epigenetic clocks can accurately predict chronological age in modern humans, but their applicability to ancient human DNA is unclear due to data limitation and indirect methylation inference. Here, we evaluate the performance of existing epigenetic clocks on reconstructed ancient human methylomes combining high-coverage genomic data and a correction framework adapted to mitigate damage-derived sequence bias. Across multiple CpG window sizes, neither direct clock application nor regression-based retraining produced reliable continuous age-at-death estimates. Reframing age inference as adult-subadult classification did not return statistically supported age classes either. In contrast, sex estimation based on X-chromosome methylation achieved perfect accuracy, before and after correction. Together, these results indicate that current palaeo-epigenetic approaches reliably recover global biological signals but are not sufficiently sensitive to capture gradual, age-related variation in humans. Estimating age-at-death from ancient methylomes will therefore require methodological advances beyond correction alone, including reference data and improved models for inferring damage-derived epigenetic signals.
Mostafavi, S.; Tu, X.; Spiro, A.; Chikina, M.
Show abstract
Sequence-to-function (S2F) models trained on reference genomes have achieved strong performance on regulatory prediction and variant-effect benchmarks, yet they still struggle to predict inter-individual variation in gene expression from personal genomes. We evaluated AlphaGenome on personal genome prediction in two molecular modalities--gene expression and chromatin accessibility--and observed a striking dichotomy: AlphaGenome approaches the heritability ceiling for chromatin accessibility variation, but remains far below baseline for gene-expression variation, despite improving over Borzoi. Context truncation and fine-mapped QTL analyses indicate that accessibility is governed by local regulatory grammar captured by current architectures, whereas gene-expression variation requires long-range regulatory integration that remains challenging.
Dineen, L.; Wilson, D.; LaBella, A. L.
Show abstract
tRNA are adapter molecules with an integral role in translation and further roles in stress adaptation. Processing of tRNA is tightly regulated and includes the enzymatic addition of several post-transcriptional modifications that are required for translation efficiency, recognition, selective translation, and structure. We currently lack a multi-species wide view of tRNA modifying enzymes across eukaryotes. Here, we performed a comparative analysis of tRNA gene sequence, modification enzymes, and modification profiles across the Saccharomycotina subphylum. We employed machine learning methods to explore tRNA sequence conservation and to annotate modifying enzymes known to exist in fungi, humans, and prokaryotes. We then applied Nano-tRNAseq to three species (Saccharomyces cerevisiae, Hanseniaspora uvarum, and Yarrowia lipolytica) to profile modification signatures and compare modification patterns. We identified substantial lineage-specific conservation of tRNA sequences despite the highly conserved tRNA structure. We found significant variation in tRNA modifying enzyme repertoires across Saccharomycotina, including lineage-specific losses, and annotated a prokaryotic-associated enzyme, tilS. Integrating genomic and sequencing data enabled us to link enzyme repertoires with tRNA gene sequences. tRNA sequencing revealed distinct modification signatures across the three focal species, and further analysis using General Linearized modelling suggested tRNA enzyme loss is associated with target tRNA nucleotide absence in gene sequences. This work provides the first integrated view of tRNA gene and modification diversity in eukaryotes and expands the field of tRNA diversity in fungi.
Liu, X.; Singh, R.; Ramachandran, S.
Show abstract
Clustering is widely used to identify cell types in cellular-resolution transcriptomic data, including single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST). Mixed-membership clustering assigns fractional memberships across clusters and captures continuous variation beyond hard clustering, but integrating and interpreting results from either approach is complicated by the "clustering alignment problem," which arises from label switching, multi-modality, and differences in model settings (including differing numbers of clusters). We introduce ACE-OF-Clust, enabling a four-step workflow for single-cell clustering: multiple clustering, clustering alignment, model comparison, and identification of informative features. ACE-OF-Clust introduces direct comparison of clustering solutions, assesses consistency against annotations, and leverages feature-level clustering profiles to prioritize genes discriminating among cell types. We demonstrate its utility on PBMC scRNA-seq and breast cancer ST data, and on multi-omic single-cell data. ACE-OF-Clust quantifies cross-omic clustering variability and suggests putative cross-omic regulatory links. Overall, ACE-OF-Clust increases the interpretability, flexibility, and robustness of single-cell clustering, providing a scalable tool for studying cellular heterogeneity and gene expression dynamics.
Tanner, R. M.; Perkins, T. J.
Show abstract
Histone modifications are a key component of the epigenetic state of a cell, and they vary widely across different cell and tissue types, conditions, and disease states. Indeed, the majority of the genome is enriched with one histone mark or another across the thousands of cellular conditions that have been studied to date. Here, we use the largest-to-date collection of histone modification ChIP-seq datasets to identify the most important sites of histone modifications genome-wide. Collected and uniformly reprocessed by the International Human Epigenome Consortium, this data includes 5339 datasets enriched at nearly one billion total peaks across 59 different major cell or tissue types and in healthy and disease conditions, for six different histone marks. We propose FindMetapeaks, a new approach to identifying histone mark metapeaks, which are genomic regions with enrichment of a mark across many samples. We show that many of these epigenetic metapeaks are strongly indicative of cell and tissue type, or are associated with other sample characteristics, and highlight key regulatory regions of the genome. However, we also show that many metapeaks contain redundant information, and that parsimonious subsets of metapeaks can be selected by machine learning to predict cell state. Our histone mark metapeak atlas provides a concise set of regions for interpreting the epigenome. Availabilityhttps://github.com/rmbioinfo83/FindMetapeaks/
Siguret, C.; Olivier, M.; Huneau, C.; SOW, M. D.; Stenger, P.-L.; Klopp, C.; Martin, M.-L.; Tamby, J.-P.; Civan, P.; Pont, C.; Mathieu, O.; SALSE, J.
Show abstract
AGR, for Ancestral Genome Reconstruction, is an automatic publicly available and open-source pipeline to infer paleogenomes from modern species genome comparisons exploiting the concept of inter-species chromosomal synteny relationships hierarchical clustering that can be used to unveil how ancestral genomes, genes, sequences and functions have been shaped during million years of present-day plant evolution.
Zhang, H.; Kang, L.; Wang, J.; Liang, K. P.; Wang, Z.; Xu, K.; Zang, C.
Show abstract
Inference of transcriptional regulatory mechanisms from single-cell (sc) omics data, such as scRNA-seq, scATAC-seq, and scMultiome, remains an important problem in single-cell biology and functional genomics. Most existing methods for predicting functional transcriptional regulators (TRs) from single-cell data rely on co-expression between regulator and target genes and/or sequence motif enrichment, holding inherent limitations. Here, we present BARTsc, a computational method that accurately predicts functional TRs from clustered single-cell omics data by leveraging a large collection of public ChIP-seq profiles. BARTsc implements a novel framework to infer a cis-regulatory profile from differential genomic features from either unimodal (RNA or ATAC) or bimodal (scMultiome) single-cell profiling data and identify TRs whose binding profiles most associate with the cis-regulatory profile. BARTsc can quantify TR activity across cell clusters and predict key regulators for each cell cluster. We demonstrate that BARTsc can successfully identify active TRs in each cell type and cell-type-defining key regulators across diverse biological systems, including mouse cortex, human peripheral blood mononuclear cells (PBMCs), and human pancreatic ductal adenocarcinoma (PDAC). Using a generative-AI-assisted, literature-supported collection of cell-type key regulators as benchmarks, we show that BARTsc consistently outperforms existing state-of-the-art methods. We apply BARTsc to identify critical regulators in PDAC, including NEFLA, a novel PDAC key regulator, and validate its function in pancreatic tumor proliferation by experiments. As a robust and versatile computational method, BARTsc provides deeper insights into cell-type-specific regulatory programs, facilitating the discovery of key regulators across diverse biological systems.
Nolte, N. F.; Gruden, K.; Petek, M.
Show abstract
MotivationLong-read RNA-seq and phased reference genomes enable haplotype-resolved gene and isoform expression analysis. While methods and tools exist for diploid organisms, analysis tools for polyploids are lacking. ResultsWe developed an end-to-end framework for allele-specific gene and isoform analysis in polyploids with three components: Syntelogfinder identifies syntenic genes in phased assemblies; longrnaseq quantifies transcripts, discovers novel isoforms, and performs quality control of long-read RNA-seq; and PolyASE analyzes differential allelic expression, differential isoform usage between conditions, and structural differences in major isoforms between haplotypes. We demonstrate the use of the framework on diploid rice and autotetraploid potato. Availability and ImplementationSyntelogfinder and longrnaseq are implemented in Nextflow and available on GitHub. PolyASE is a Python package available on PyPI. The framework is fully documented and tutorials are provided. ContactNadja.nolte.franziska@nib.si Supplementary informationSupplementary data are available online and on Zenodo.
Schroeder, L.; Gerber, S.; Ruffini, N.
Show abstract
BackgroundAmbient RNA contamination is a pervasive artifact of single-cell and single-nucleus RNA sequencing (sxRNA-seq), yet no consensus exists on which computational removal tool performs best across experimental platforms. ResultsWe present a systematic benchmark of six tools: CellBender, DecontX, SoupX, scCDC, scAR, and CellClear - evaluated across six human-mouse cell line mixing (hgmm) datasets (1k-20k cells) providing partial ground truth, two droplet-based complex tissue datasets (PBMC scRNA-seq; prefrontal cortex snRNA-seq), and a well-plate-based dataset (BD Rhapsody WBC). Using inter-species counts as partial ground truth, we quantify sensitivity, specificity, precision, and removal consistency per tool. We further apply a count-integrity criterion quantifying gene-cell positions where corrected values exceed raw counts. This reveals that scAR and CellClear do not merely denoise but fundamentally restructure count matrices: CellClear replaces >93% of counts with values derived from matrix factorization, while scAR generates spurious cell types absent from uncorrected data, including three spurious coarse cell types in the BD Rhapsody dataset and up to eight novel cell types in the prefrontal cortex. CellBender and SoupX exhibit reliable contamination removal with minimal count distortion. DecontX and scCDC are the only tools operable on non-droplet platforms without raw count matrix access. Runtime benchmarking at atlas scale (up to 172,000 nuclei) further demonstrates that CellClear fails to scale. ConclusionsCount matrix integrity, not removal sensitivity alone, must be a primary criterion when selecting ambient RNA correction tools. We provide platform-specific recommendations and a decision framework to guide tool selection across experimental contexts.
KP, M. M.
Show abstract
Variant Call Format (VCF) files are the dominant interchange format for genomic variant data, but their size - routinely exceeding tens of gigabytes for population-scale studies - creates a significant computational bottleneck at the quality-filtering stage. Existing tools such as bcftools and vcftools provide broad functionality through general-purpose expression engines, but incur substantial per-record overhead from dynamic field lookup, type resolution, and heap allocation. We present vcfilt, a streaming, batch-parallel VCF filter implemented in Go that restricts its scope to three high-frequency filter criteria (INFO/DP, INFO/AF, and QUAL) and applies them via a zero-allocation byte-scan parser. Benchmarked on real 1000 Genomes Project data (chromosome 20, 1,811,146 variants), vcfilt achieves 147,000 variants/second on an 18 GB plain-text VCF file using a single thread - a 12.2x speedup over bcftools 1.18 under identical conditions. On gzip-compressed input, the speedup is 7.9x. Output is byte-for-byte identical to bcftools across all tested filter combinations. vcfilt is distributed as a self-contained static binary, a Docker image, and a Singularity-compatible container. The source code and all benchmark scripts are openly available under the MIT licence.